Problem Definition

Kira Plastinina is a Russian brand that is sold through a defunct chain of retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines, and Armenia. The brand’s Sales and Marketing team would like to understand their customer’s behavior from data that they have collected over the past year. More specifically, they would like to learn the characteristics of customer groups.

a)Specifying the Question

  1. Perform clustering stating insights drawn from your analysis and visualizations.

  2. Provide comparisons between the approaches learned this week i.e. K-Means clustering vs Hierarchical clustering highlighting the strengths and limitations of each approach in the context of your analysis.

b)The metric of Success

Be able to understand the characteristics of Kira Plastinina Customers and Classify the into by using unsupervised machine learning algorithms

c)The Context

Study of online shoppers for specifically the Kira Plastinina brand

d)Experimental Design

  1. Data Sourcing
  2. Check the Data
  3. Perform Data Cleaning
  4. Analyze the Data

Data Sourcing

shoppers <- read.csv("~/Documents/R markdowns/IP Week 13/online_shoppers_intention.csv")

head(shoppers)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                      -1             0                     -1
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                0.000000  0.20000000 0.2000000          0
## 2              2               64.000000  0.00000000 0.1000000          0
## 3              1               -1.000000  0.20000000 0.2000000          0
## 4              2                2.666667  0.05000000 0.1400000          0
## 5             10              627.500000  0.02000000 0.0500000          0
## 6             19              154.216667  0.01578947 0.0245614          0
##   SpecialDay Month OperatingSystems Browser Region TrafficType
## 1          0   Feb                1       1      1           1
## 2          0   Feb                2       2      1           2
## 3          0   Feb                4       1      9           3
## 4          0   Feb                3       2      2           4
## 5          0   Feb                3       3      1           4
## 6          0   Feb                2       2      1           3
##         VisitorType Weekend Revenue
## 1 Returning_Visitor   FALSE   FALSE
## 2 Returning_Visitor   FALSE   FALSE
## 3 Returning_Visitor   FALSE   FALSE
## 4 Returning_Visitor   FALSE   FALSE
## 5 Returning_Visitor    TRUE   FALSE
## 6 Returning_Visitor   FALSE   FALSE

Check the Data

# check the structure of Data
str(shoppers)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 1 2 3 ...
##  $ ProductRelated_Duration: num  0 64 -1 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : chr  "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
dim(shoppers)
## [1] 12330    18
summary(shoppers)
##  Administrative   Administrative_Duration Informational   
##  Min.   : 0.000   Min.   :  -1.00         Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.000  
##  Median : 1.000   Median :   8.00         Median : 0.000  
##  Mean   : 2.318   Mean   :  80.91         Mean   : 0.504  
##  3rd Qu.: 4.000   3rd Qu.:  93.50         3rd Qu.: 0.000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.000  
##  NA's   :14       NA's   :14              NA's   :14      
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :  -1.00        Min.   :  0.00   Min.   :   -1.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  185.0        
##  Median :   0.00        Median : 18.00   Median :  599.8        
##  Mean   :  34.51        Mean   : 31.76   Mean   : 1196.0        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1466.5        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##  NA's   :14             NA's   :14       NA's   :14             
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003119   Median :0.02512   Median :  0.000   Median :0.00000  
##  Mean   :0.022152   Mean   :0.04300   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016684   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##  NA's   :14         NA's   :14                                           
##     Month           OperatingSystems    Browser           Region     
##  Length:12330       Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000    Median : 2.000   Median :3.000  
##                     Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##                     3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##                     Max.   :8.000    Max.   :13.000   Max.   :9.000  
##                                                                      
##   TrafficType    VisitorType         Weekend         Revenue       
##  Min.   : 1.00   Length:12330       Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Class :character   FALSE:9462      FALSE:10422    
##  Median : 2.00   Mode  :character   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                     
##  3rd Qu.: 4.00                                                     
##  Max.   :20.00                                                     
## 
# convert all negative values to NA

shoppers[shoppers<0] <- NA
 # convert into a factor
shoppers$VisitorType <- factor(shoppers$VisitorType)
head(shoppers$VisitorType)
## [1] Returning_Visitor Returning_Visitor Returning_Visitor Returning_Visitor
## [5] Returning_Visitor Returning_Visitor
## Levels: New_Visitor Other Returning_Visitor
shoppers$Weekend <- factor(shoppers$Weekend)
head(shoppers$Weekend)
## [1] FALSE FALSE FALSE FALSE TRUE  FALSE
## Levels: FALSE TRUE
shoppers$Revenue <- factor(shoppers$Revenue)
head(shoppers$Revenue)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## Levels: FALSE TRUE
shoppers$Month <- factor(shoppers$Month)
head(shoppers$Month)
## [1] Feb Feb Feb Feb Feb Feb
## Levels: Aug Dec Feb Jul June Mar May Nov Oct Sep

Perform Data Cleaning

#check for missing values
colSums(is.na(shoppers))
##          Administrative Administrative_Duration           Informational 
##                      14                      47                      14 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                      47                      14                      47 
##             BounceRates               ExitRates              PageValues 
##                      14                      14                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0
anyNA(shoppers)
## [1] TRUE
#library("Amelia")
#missmap(shoppers)

# use MICE to predict missing values
library("mice")
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
mice_mod <- mice(shoppers[, c("Administrative","Administrative_Duration","Informational","Informational_Duration","ProductRelated","ProductRelated","ProductRelated_Duration","BounceRates","ExitRates")], method='rf')
## 
##  iter imp variable
##   1   1  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   1   2  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   1   3  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   1   4  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   1   5  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   2   1  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   2   2  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   2   3  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   2   4  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   2   5  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   3   1  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   3   2  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   3   3  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   3   4  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   3   5  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   4   1  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   4   2  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   4   3  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   4   4  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   4   5  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   5   1  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   5   2  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   5   3  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   5   4  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
##   5   5  Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates
## Warning: Number of logged events: 1
mice_complete <- complete(mice_mod)

shoppers$Administrative <- mice_complete$Administrative
shoppers$Administrative_Duration  <- mice_complete$Administrative_Duration
shoppers$Informational <- mice_complete$Informational
shoppers$Informational_Duration <- mice_complete$Informational_Duration
shoppers$ProductRelated <- mice_complete$ProductRelated
shoppers$ProductRelated_Duration <- mice_complete$ProductRelated_Duration
shoppers$BounceRates <- mice_complete$BounceRates
shoppers$ExitRates <- mice_complete$ExitRates
# Check for any missing values
anyNA(shoppers)
## [1] FALSE
shoppers <- unique(shoppers)
dim(shoppers)
## [1] 12205    18
num_col <- shoppers[ ,c(1,2,3,4,5,6,7,8,9,10,12,13,14,15)]

outlier_detection = function(x){
  for(i in colnames(x)){
    boxplot(shoppers[[i]], xlab=i, main=paste0("Boxplot for ",i))
  }
}

outlier_detection(num_col)

# replace outliers with the 5th and 95th percentile
outlier_replace <- function(x){
   qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
   caps <- quantile(x, probs=c(.05, .95), na.rm = T)
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qnt[1] - H)] <- caps[1]
   x[x > (qnt[2] + H)] <- caps[2]
   return(x)
}
shoppers$Administrative <- outlier_replace(shoppers$Administrative)
shoppers$Administrative_Duration <-outlier_replace(shoppers$Administrative_Duration)
shoppers$Informational <- outlier_replace(shoppers$Informational)
shoppers$Informational_Duration <- outlier_replace(shoppers$Informational_Duration )
shoppers$ProductRelated <- outlier_replace(shoppers$ProductRelated)
shoppers$ProductRelated_Duration <- outlier_replace(shoppers$ProductRelated_Duration)
shoppers$BounceRates <- outlier_replace(shoppers$BounceRates)
shoppers$ExitRates <- outlier_replace(shoppers$ExitRates)
shoppers$PageValues <- outlier_replace(shoppers$PageValues)
shoppers$SpecialDay <- outlier_replace(shoppers$SpecialDay)
shoppers$OperatingSystems <- outlier_replace(shoppers$OperatingSystems)
shoppers$Browser <- outlier_replace(shoppers$Browser)
shoppers$Region <- outlier_replace(shoppers$Region)
shoppers$TrafficType <- outlier_replace(shoppers$TrafficType)

# check to see if there are more outliers
outlier_detection(num_col)

From our Analysis we see that we have a total of 12330 records with 18 columns, 10 columns are numeric and 8 columns are categorical

Columns that had discrete values and are categorical like:- a) Months b) Weekend c) Revenue d) Visitor Type

They were converted to factors and all the numeric columns with negative values were replace with NA. The dataset had Missing values and they were replaced by the library MICE that predict the missing values using random forest.

There were duplicated values and there were dropped, reducing the records to 12210 records and 18 columns.

The numeric columns had outliers and capping was used by replacing the outliers with the 5th and 95th percentile, although some outliers remained. The outliers were not removed.

Perform Exploratory Data Analysis

Univariate

Categorical Variables

#### Numeric Variables

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(num_col, skew=TRUE)
## num_col 
## 
##  14  Variables      12205  Observations
## --------------------------------------------------------------------------------
## Administrative 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0       27    0.898     2.34    3.193        0        0 
##      .25      .50      .75      .90      .95 
##        0        1        4        7        9 
## 
## lowest :  0  1  2  3  4, highest: 22 23 24 26 27
## --------------------------------------------------------------------------------
## Administrative_Duration 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0     3335    0.894    81.65    125.7      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      9.0     94.7    227.0    352.2 
## 
## lowest :    0.000000    1.333333    2.000000    3.000000    3.500000
## highest: 2407.423810 2629.253968 2657.318056 2720.500000 3398.750000
## --------------------------------------------------------------------------------
## Informational 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0       17    0.516   0.5086   0.8759        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        2        3 
## 
## lowest :  0  1  2  3  4, highest: 12 13 14 16 24
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency   9575  1041   727   380   222    99    78    36    14    15     7
## Proportion 0.785 0.085 0.060 0.031 0.018 0.008 0.006 0.003 0.001 0.001 0.001
##                                               
## Value         11    12    13    14    16    24
## Frequency      1     5     1     2     1     1
## Proportion 0.000 0.000 0.000 0.000 0.000 0.000
## --------------------------------------------------------------------------------
## Informational_Duration 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0     1258    0.482    34.82    64.67      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0      0.0     73.9    199.0 
## 
## lowest :    0.000    1.000    1.500    2.000    2.500
## highest: 2166.500 2195.300 2252.033 2256.917 2549.375
## --------------------------------------------------------------------------------
## ProductRelated 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0      311    0.999    32.07     36.5        2        3 
##      .25      .50      .75      .90      .95 
##        8       18       38       74      110 
## 
## lowest :   0   1   2   3   4, highest: 518 534 584 686 705
## --------------------------------------------------------------------------------
## ProductRelated_Duration 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0     9551        1     1208     1491     0.00    45.08 
##      .25      .50      .75      .90      .95 
##   194.23   610.19  1477.39  2904.46  4312.96 
## 
## lowest :     0.000000     0.500000     1.000000     2.333333     2.666667
## highest: 24844.156200 27009.859430 29970.465970 43171.233380 63973.522230
## --------------------------------------------------------------------------------
## BounceRates 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0     1872    0.908  0.02034  0.03271 0.000000 0.000000 
##      .25      .50      .75      .90      .95 
## 0.000000 0.002899 0.016667 0.050000 0.145905 
## 
## lowest : 0.0000000 0.0000273 0.0000335 0.0000383 0.0000394
## highest: 0.1750000 0.1769231 0.1800000 0.1833333 0.2000000
## --------------------------------------------------------------------------------
## ExitRates 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0     4777        1  0.04141  0.04219 0.004545 0.007407 
##      .25      .50      .75      .90      .95 
## 0.014226 0.025000 0.048447 0.100000 0.170893 
## 
## lowest : 0.000000000 0.000175593 0.000250438 0.000262123 0.000263158
## highest: 0.183333333 0.186666667 0.188888889 0.192307692 0.200000000
## --------------------------------------------------------------------------------
## PageValues 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0     2704    0.532     5.95    10.65     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.00     0.00    19.11    38.30 
## 
## lowest :   0.00000000   0.03803454   0.06704955   0.09354695   0.09862140
## highest: 261.49128570 270.78469310 287.95379280 360.95338390 361.76374190
## --------------------------------------------------------------------------------
## SpecialDay 
##        n  missing distinct     Info     Mean      Gmd 
##    12205        0        6    0.277  0.06194   0.1141 
## 
## lowest : 0.0 0.2 0.4 0.6 0.8, highest: 0.2 0.4 0.6 0.8 1.0
##                                               
## Value        0.0   0.2   0.4   0.6   0.8   1.0
## Frequency  10956   178   243   350   324   154
## Proportion 0.898 0.015 0.020 0.029 0.027 0.013
## --------------------------------------------------------------------------------
## OperatingSystems 
##        n  missing distinct     Info     Mean      Gmd 
##    12205        0        8    0.828    2.124   0.8613 
## 
## lowest : 1 2 3 4 5, highest: 4 5 6 7 8
##                                                           
## Value          1     2     3     4     5     6     7     8
## Frequency   2549  6541  2530   478     6    19     7    75
## Proportion 0.209 0.536 0.207 0.039 0.000 0.002 0.001 0.006
## --------------------------------------------------------------------------------
## Browser 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0       13    0.722    2.358    1.315        1        1 
##      .25      .50      .75      .90      .95 
##        2        2        2        4        5 
## 
## lowest :  1  2  3  4  5, highest:  9 10 11 12 13
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency   2427  7883   105   731   465   174    49   135     1   163     6
## Proportion 0.199 0.646 0.009 0.060 0.038 0.014 0.004 0.011 0.000 0.013 0.000
##                       
## Value         12    13
## Frequency     10    56
## Proportion 0.001 0.005
## --------------------------------------------------------------------------------
## Region 
##        n  missing distinct     Info     Mean      Gmd 
##    12205        0        9    0.933    3.153    2.553 
## 
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##                                                                 
## Value          1     2     3     4     5     6     7     8     9
## Frequency   4714  1128  2379  1171   318   801   758   431   505
## Proportion 0.386 0.092 0.195 0.096 0.026 0.066 0.062 0.035 0.041
## --------------------------------------------------------------------------------
## TrafficType 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12205        0       20    0.954    4.074    3.735        1        1 
##      .25      .50      .75      .90      .95 
##        2        2        4       11       13 
## 
## lowest :  1  2  3  4  5, highest: 16 17 18 19 20
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency   2388  3911  2013  1066   260   443    40   343    41   450   247
## Proportion 0.196 0.320 0.165 0.087 0.021 0.036 0.003 0.028 0.003 0.037 0.020
##                                                                 
## Value         12    13    14    15    16    17    18    19    20
## Frequency      1   728    13    37     3     1    10    17   193
## Proportion 0.000 0.060 0.001 0.003 0.000 0.000 0.001 0.001 0.016
## --------------------------------------------------------------------------------
histogram = function(x){
  for(i in colnames(x)){
    hist(shoppers[[i]], breaks = 10,main =i,xlab = i,col = "dodgerblue")
  }
}

histogram(num_col)

### Bivariate #### Correlation

# Convert the Revenue Col to Numeric
shoppers$Revenue <- as.numeric(shoppers$Revenue)
nums <- shoppers[ ,c(1,2,3,4,5,6,7,8,9,10,12,13,14,15,18)]

# Correlation matrix
corr_matrix = cor(nums)
corr <- as.data.frame(round(corr_matrix,2))
corr
##                         Administrative Administrative_Duration Informational
## Administrative                    1.00                    0.77          0.37
## Administrative_Duration           0.77                    1.00          0.32
## Informational                     0.37                    0.32          1.00
## Informational_Duration            0.37                    0.32          0.94
## ProductRelated                    0.44                    0.33          0.38
## ProductRelated_Duration           0.39                    0.34          0.37
## BounceRates                      -0.25                   -0.20         -0.14
## ExitRates                        -0.35                   -0.29         -0.20
## PageValues                        0.35                    0.30          0.23
## SpecialDay                       -0.11                   -0.10         -0.05
## OperatingSystems                  0.00                   -0.01          0.00
## Browser                          -0.03                   -0.04         -0.03
## Region                            0.00                    0.01         -0.02
## TrafficType                      -0.03                   -0.02         -0.03
## Revenue                           0.14                    0.13          0.11
##                         Informational_Duration ProductRelated
## Administrative                            0.37           0.44
## Administrative_Duration                   0.32           0.33
## Informational                             0.94           0.38
## Informational_Duration                    1.00           0.37
## ProductRelated                            0.37           1.00
## ProductRelated_Duration                   0.37           0.85
## BounceRates                              -0.14          -0.25
## ExitRates                                -0.21          -0.37
## PageValues                                0.24           0.34
## SpecialDay                               -0.05          -0.03
## OperatingSystems                          0.00           0.03
## Browser                                  -0.03           0.00
## Region                                   -0.01          -0.04
## TrafficType                              -0.03          -0.05
## Revenue                                   0.11           0.17
##                         ProductRelated_Duration BounceRates ExitRates
## Administrative                             0.39       -0.25     -0.35
## Administrative_Duration                    0.34       -0.20     -0.29
## Informational                              0.37       -0.14     -0.20
## Informational_Duration                     0.37       -0.14     -0.21
## ProductRelated                             0.85       -0.25     -0.37
## ProductRelated_Duration                    1.00       -0.24     -0.34
## BounceRates                               -0.24        1.00      0.78
## ExitRates                                 -0.34        0.78      1.00
## PageValues                                 0.34       -0.19     -0.26
## SpecialDay                                -0.05        0.14      0.13
## OperatingSystems                           0.03        0.04      0.02
## Browser                                    0.01       -0.03     -0.01
## Region                                    -0.02       -0.01     -0.01
## TrafficType                               -0.05        0.09      0.07
## Revenue                                    0.18       -0.16     -0.21
##                         PageValues SpecialDay OperatingSystems Browser Region
## Administrative                0.35      -0.11             0.00   -0.03   0.00
## Administrative_Duration       0.30      -0.10            -0.01   -0.04   0.01
## Informational                 0.23      -0.05             0.00   -0.03  -0.02
## Informational_Duration        0.24      -0.05             0.00   -0.03  -0.01
## ProductRelated                0.34      -0.03             0.03    0.00  -0.04
## ProductRelated_Duration       0.34      -0.05             0.03    0.01  -0.02
## BounceRates                  -0.19       0.14             0.04   -0.03  -0.01
## ExitRates                    -0.26       0.13             0.02   -0.01  -0.01
## PageValues                    1.00      -0.07            -0.01    0.02  -0.01
## SpecialDay                   -0.07       1.00             0.02    0.01  -0.02
## OperatingSystems             -0.01       0.02             1.00    0.14   0.01
## Browser                       0.02       0.01             0.14    1.00   0.05
## Region                       -0.01      -0.02             0.01    0.05   1.00
## TrafficType                  -0.03       0.04             0.10    0.00   0.01
## Revenue                       0.60      -0.09            -0.02    0.02  -0.01
##                         TrafficType Revenue
## Administrative                -0.03    0.14
## Administrative_Duration       -0.02    0.13
## Informational                 -0.03    0.11
## Informational_Duration        -0.03    0.11
## ProductRelated                -0.05    0.17
## ProductRelated_Duration       -0.05    0.18
## BounceRates                    0.09   -0.16
## ExitRates                      0.07   -0.21
## PageValues                    -0.03    0.60
## SpecialDay                     0.04   -0.09
## OperatingSystems               0.10   -0.02
## Browser                        0.00    0.02
## Region                         0.01   -0.01
## TrafficType                    1.00    0.00
## Revenue                        0.00    1.00
admin <- shoppers$Administrative
admin_d <- shoppers$Administrative_Duration
info <- shoppers$Informational
info_d <- shoppers$Informational_Duration
prod <- shoppers$ProductRelated
prod_d <- shoppers$ProductRelated_Duration
exit <- shoppers$ExitRates
bounce <- shoppers$BounceRates
page <- shoppers$PageValues

# Administrative
plot(admin,exit,main = "Relationship between the Administrative and Exit Rate", xlab="Administrative", ylab="Exit Rate")

plot(admin,bounce,main = "Relationship between the Admintrative and Bounce Rate",xlab="Administrative",ylab = "Bounce Rate")

plot(admin,page, main = "Relationship between the Admintrative and Page Values",xlab="Administrative",ylab = "Page Value")

plot(admin, admin_d = "Relationship between the Administrative and Administrative Duration",xlab="Administrative",ylab = "Administrative Duration")
## Warning in plot.window(...): "admin_d" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "admin_d" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "admin_d" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "admin_d" is not a
## graphical parameter
## Warning in box(...): "admin_d" is not a graphical parameter
## Warning in title(...): "admin_d" is not a graphical parameter

# Informational
plot(info,exit,main = "Relationship between the Informational and Exit Rate",xlab = "Informational",ylab = "Exit Rates")

plot(info,bounce,main = "Relationship between the Informational and Bounce Rate",xlab = "Informational", ylab = "Bounce Rates")

plot(info,page, main = "Relationship between the Informational and Page Values",xlab = "Informational",ylab = "Page Values")

plot(info, info_d, main = "Relationship between the Informational and Informational Duration", xlab = "Informational",ylab = "Informational Duration")

# Product Related
plot(prod,exit,main = "Relationship between the Product Related and Exit Rate",xlab="Product Related",ylab = "Exit Rates")

plot(prod,bounce,main = "Relationship between the Product Related and Bounce Rate",xlab="Product Related",ylab = "Bounce Rates")

plot(prod,page, main = "Relationship between the Product Related and Page Values",xlab="Product Related",ylab = "Page Values")

plot(prod, prod_d, main = "Relationship between the Product Related and ProductRelated Duration",xlab="Product Related",ylab ="Product Related Duration")

#Duration
plot(admin_d,info_d, main = "Relationship between Admin Duration and Informational Duration", xlab = "Administrative Duration", ylab = "Informational Duration")

plot(prod_d, info_d,main = "Relationship between ProductRelated Duration and Informational Duration", xlab = "Product Related Duration",ylab = "Informational Duration")

plot(admin_d,prod_d,main = "Relationship between Admin Duration and ProductRelated Duration", xlab = "Administrative Duration", ylab = "Product Related Duration")

library(dplyr)
# how much page values do we have on weekends
by_weekend <- shoppers %>%
  group_by(Weekend) %>%
  summarise(pagevalues = sum(PageValues))

pv <- ggplot(by_weekend,aes(x=Weekend,y=pagevalues, fill=factor(Weekend)))+geom_bar(stat = "identity")+scale_fill_discrete(name = "Weekend", labels = c("Not Weekend","Weekend"))+ labs(title="Page Values do we have on Weekends", x="Weekend",y="Total value of Page Values")
pv 

# Pages with the Highest Exit rate
## Administrative
by_admin <- shoppers %>%
  group_by(Administrative)%>%
  summarise(Exit = sum(ExitRates))

ad <- ggplot(by_admin,aes(x=factor(Administrative),y=Exit, fill=factor(Administrative)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Exit Rate in Administratuve", x="Administrative Pages",y=" Cumulative % of Exit Rate per Page")
ad+theme(legend.position = "none")

## Informational
by_info <- shoppers %>%
  group_by(Informational)%>%
  summarise(Exit = sum(ExitRates))

inf <- ggplot(by_info,aes(x=factor(Informational),y=Exit, fill=factor(Informational)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Exit Rate in Informational", x="Informational Pages",y="Cumulative % of Exit Rateper Page")
inf+theme(legend.position = "none")

## Product Related
by_prod <- shoppers %>%
  group_by(ProductRelated)%>%
  summarise(Exit = sum(ExitRates))

pro <- ggplot(by_prod[1:20,],aes(x=reorder(factor(ProductRelated),Exit),y=Exit, fill=factor(ProductRelated)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Exit Rate in Product Related", x="Product Related Pages",y="Cumulative % of Exit Rate per Page")
pro+theme(legend.position = "none")

# Pages with the Highest Bounce rate
## Administrative
by_admin1 <- shoppers %>%
  group_by(Administrative)%>%
  summarise(Bounce = sum(BounceRates))

ad1 <- ggplot(by_admin1,aes(x=factor(Administrative),y=Bounce, fill=factor(Administrative)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of  Bounce Rate in Administrative", x="Administrative Pages",y=" Cumulative % of Bounce Rate per Page")
ad1 +theme(legend.position = "none")

## Informational
by_info1 <- shoppers %>%
  group_by(Informational)%>%
  summarise(Bounce = sum(BounceRates))

inf1 <- ggplot(by_info1,aes(x=factor(Informational),y=Bounce, fill=factor(Informational)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of  Bounce Rate in Informational", x="Informational Pages",y=" Cumulative % of Bounce Rate per Page")

inf1+theme(legend.position = "none")

## Product Related
by_prod1 <- shoppers %>%
  group_by(ProductRelated)%>%
  summarise(Bounce = sum(BounceRates))

pro1 <- ggplot(by_prod1[1:20,],aes(x=reorder(factor(ProductRelated),Bounce),y=Bounce, fill=factor(ProductRelated)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of  Bounce Rate in Product Related",subtitle = "(Plotted for the Top 20 Pages)" ,x="Product Related Pages",y="Cumulative % Bounce Rate per Page")
pro1+theme(legend.position = "none")

# Pages with the Highest Bounce rate
## Administrative
by_admin2 <- shoppers %>%
  group_by(Administrative)%>%
  summarise(Page = sum(PageValues))

ad2 <- ggplot(by_admin2,aes(x=factor(Administrative),y=Page, fill=factor(Administrative)))+geom_bar(stat = "identity")+ labs(title="Total Average for Page Values in Administrative", x="Administrative Pages",y=" Total Average for Page Values per Page")
ad2 +theme(legend.position = "none")

## Informational
by_info2 <- shoppers %>%
  group_by(Informational)%>%
  summarise(Page = sum(PageValues))

inf2 <- ggplot(by_info2,aes(x=factor(Informational),y=Page, fill=factor(Informational)))+geom_bar(stat = "identity")+ labs(title="Total Average for Page Values in Informational", x="Informational Pages",y=" Total Average for Page Values per Page")

inf2+theme(legend.position = "none")

## Product Related
by_prod2 <- shoppers %>%
  group_by(ProductRelated)%>%
  summarise(Page = sum(PageValues))

pro2 <- ggplot(by_prod2[1:20,],aes(x=reorder(factor(ProductRelated),Page),y=Page, fill=factor(ProductRelated)))+geom_bar(stat = "identity")+ labs(title="Total Average forvPage Values in Product Related", x="Product Related Pages",y="Total Average for Page Values per Page")
pro2+theme(legend.position = "none")

# Exit Rates over the Months
exit_months <- shoppers %>%
  group_by(Month)%>%
  summarise(exit = sum(ExitRates), bounce =sum(BounceRates), pages = sum(PageValues))


#Exit Rate
mon <- ggplot(exit_months,aes(x=reorder(factor(Month),exit),y=exit, fill=factor(Month)))+geom_bar(stat = "identity")+ labs(title="Total Exit Rate Per Month", x="Months",y="Cumulative % for Exit Rate")

mon+theme(legend.position = "none")

#Bounce Rates
mon1 <- ggplot(exit_months,aes(x=reorder(factor(Month),bounce),y=bounce, fill=factor(Month)))+geom_bar(stat = "identity")+ labs(title="Total Bounce Rate Per Month", x="Months",y="Cumulative % for Bounce Rate")

mon1 + theme(legend.position = "none")

# PageValues
mon2 <- ggplot(exit_months,aes(x=reorder(factor(Month),pages),y=pages, fill=factor(Month)))+geom_bar(stat = "identity")+ labs(title="Total Average Page Values Per Month", x="Months",y="Total Average for Page Values")
mon2 + theme(legend.position = "none")

From our Analysis we observe a few interesting things that Special Days are in February and May. The site is only active 10 out of 12 months of the year excluding January and April Popular moths are:- 1. May 2. March 3. November 4. December

The months with most revenue are:- 1. November 2. May 3. December

Although the site experiencies no revenue as compared to revenue, which is a case of imbalanced data that we will deal with when modelling

Interestingly February has special days yet it is not a popular month and also doesn’t generate a lot of revenue

The visitors who come to the site are mostly Returning customers who are seen mostly in May,November, March and December.

In December and November we also observe a rare group of visitors, Others who are only seen during htis two months Also December,November and March experience highest numbers in new visitors Therefore the Months:- 1. December 2. November 3. March 4. May

Should be the focus of the Sales and Marketing team, they are the months for most likely to retain its customers and also gain new customers

There is more activity during Weekdays compared to Weekends, most revenue being acquired on Weekdays compared to Weekends, but still there is significant activity with revenue in both Weekdays and Weekends

There are 3 Traffic Types the most popular is Type 2, it most active in November,March and May In May three Traffic Types are most popular and that is Type 2,3 and 4. In March Type 1 is very active

Our Data is not Normally Distributed. From our correlation table we see the Administrative and Administrative Duration, Informational and Informational Duration, and Product Related and Product Related Duration have strong positive correlation which are 0.77,0.94,0.85 respectively.

This means the more the User access the most hidden pages the more likely there are to stay on the site.

Revenue has a strong positive correlation with Page Values, a correlation of 0.6. An increase in Page Values the higher the revenue

Bounce Rates and Exit Rates also have a strong positive correlation of 0.78, where when a page has been visited it has a likely chance to “bounced” out on. Therefore we see a lot of the pages that had high number of visitors also had a high number of Bounces. This were pages:- 1. Page 0 - Administrative 2. Page 0 - Informational 3. Page 1,2,3,4 - Product Related

This are probably the first pages the user interacts with which might explain the lack of Revenue, therefore an area to focus on for the Sales and Marketing team although the Bounce Rates are less than Exit Rates the difference, however, is not that significant

The pages that are recording high values in Page Values are:- 1.Page 0 and 9 - Administrative 2. Page 0 - Informational 3. Page 19,15,13 - Product Related

The products being viewed are in those pages you cannot easily locate

Even though the Months March, May, November and December have more Page Values and therefore Revenue, the Months December,March and November experience very high numbers of Bounce Rates.

Implement the Solution

K-Means Clustering

# Transform Factors to Numeric
shoppers$Month <- as.numeric(shoppers$Month)
shoppers$VisitorType <- as.numeric(shoppers$VisitorType)
shoppers$Weekend <- as.numeric(shoppers$Weekend)

str(shoppers)
## 'data.frame':    12205 obs. of  18 variables:
##  $ Administrative         : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : num  1 2 1 2 10 19 1 1 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.146 0 0.146 0.146 0.02 ...
##  $ ExitRates              : num  0.171 0.171 0.171 0.171 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.6 0 0.6 0.6 ...
##  $ Month                  : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : num  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : num  1 2 1 2 5 2 5 2 2 5 ...
##  $ Region                 : num  1 1 8 2 1 1 3 1 2 1 ...
##  $ TrafficType            : num  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : num  1 1 1 1 2 1 1 2 1 1 ...
##  $ Revenue                : num  1 1 1 1 1 1 1 1 1 1 ...
# K-Means is affected by Imbalanced data
table(shoppers$Revenue)
## 
##     1     2 
## 10297  1908
prop.table(table(shoppers$Revenue))
## 
##         1         2 
## 0.8436706 0.1563294
# our class label is imbalanced thus we shall balance it
# Split to train and test
library(ROSE)
## Loaded ROSE 0.0-3
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
# Make the Revenue Class a factor again
shoppers$Revenue <- as.factor(shoppers$Revenue)


# Use ROSE, it helps us to generate data synthetically as well
train.rose <- ROSE(Revenue ~ ., data = shoppers, seed = 1)$data
table(train.rose$Revenue)
## 
##    1    2 
## 6149 6056
feature <- train.rose[, -18]
label <- train.rose[,"Revenue"]
# Normalize the data
normalize <- function(x){
  return ((x-min(x)) / (max(x)-min(x)))
}

# normalize the train
feat_norm <- as.data.frame(lapply(feature, normalize))


summary(feat_norm)
##  Administrative   Administrative_Duration Informational   
##  Min.   :0.0000   Min.   :0.0000          Min.   :0.0000  
##  1st Qu.:0.2802   1st Qu.:0.3304          1st Qu.:0.2940  
##  Median :0.3723   Median :0.4028          Median :0.3723  
##  Mean   :0.4056   Mean   :0.4348          Mean   :0.4198  
##  3rd Qu.:0.5083   3rd Qu.:0.5025          3rd Qu.:0.5338  
##  Max.   :1.0000   Max.   :1.0000          Max.   :1.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :0.0000         Min.   :0.0000   Min.   :0.0000         
##  1st Qu.:0.2982         1st Qu.:0.3269   1st Qu.:0.3086         
##  Median :0.3723         Median :0.3980   Median :0.3833         
##  Mean   :0.4170         Mean   :0.4258   Mean   :0.4133         
##  3rd Qu.:0.5030         3rd Qu.:0.4949   3rd Qu.:0.4862         
##  Max.   :1.0000         Max.   :1.0000   Max.   :1.0000         
##   BounceRates       ExitRates        PageValues       SpecialDay    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2771   1st Qu.:0.3056   1st Qu.:0.3040   1st Qu.:0.2505  
##  Median :0.3139   Median :0.3472   Median :0.4028   Median :0.2962  
##  Mean   :0.3418   Mean   :0.3771   Mean   :0.4617   Mean   :0.3213  
##  3rd Qu.:0.3597   3rd Qu.:0.4066   3rd Qu.:0.6314   3rd Qu.:0.3465  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Month        OperatingSystems    Browser           Region      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.4692   1st Qu.:0.3374   1st Qu.:0.3129   1st Qu.:0.3964  
##  Median :0.5832   Median :0.4264   Median :0.3911   Median :0.4759  
##  Mean   :0.5570   Mean   :0.4334   Mean   :0.4222   Mean   :0.5012  
##  3rd Qu.:0.6663   3rd Qu.:0.5211   3rd Qu.:0.4868   3rd Qu.:0.5891  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   TrafficType      VisitorType        Weekend      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2700   1st Qu.:0.5608   1st Qu.:0.2909  
##  Median :0.3404   Median :0.6522   Median :0.3656  
##  Mean   :0.3767   Mean   :0.6097   Mean   :0.4108  
##  3rd Qu.:0.4395   3rd Qu.:0.7164   3rd Qu.:0.5130  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
#Apply the model
model <- kmeans(feat_norm,2, nstart = 30)

# check for no. of records in each cluster
model$size
## [1] 3677 8528
# a set of initial (distinct) cluster centers.
model$centers
##   Administrative Administrative_Duration Informational Informational_Duration
## 1      0.5274303               0.5339308     0.6159815              0.6020258
## 2      0.3530291               0.3920868     0.3352657              0.3372309
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1      0.5201412               0.5135446   0.3138185 0.3376627  0.5351084
## 2      0.3851206               0.3701311   0.3538725 0.3941598  0.4300299
##   SpecialDay     Month OperatingSystems   Browser    Region TrafficType
## 1  0.3083943 0.5754530        0.4325012 0.4150062 0.4970129   0.3713982
## 2  0.3268287 0.5490706        0.4337791 0.4253063 0.5029515   0.3789528
##   VisitorType   Weekend
## 1   0.6346745 0.4226689
## 2   0.5989623 0.4056178
# shows the cluster where each record falls
#model$cluster

#Verfy the results
plot(feat_norm[c(1:2)], col = model$cluster)

plot(feat_norm[c(1:2)], col = label)

plot(feat_norm[c(3:4)], col = model$cluster)

plot(feat_norm[c(3:4)], col = label)

plot(feat_norm[c(5:6)], col = model$cluster)

plot(feat_norm[c(5:6)], col = label)

tb <- table(model$cluster, label)
tb
##    label
##        1    2
##   1 1265 2412
##   2 4884 3644
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
paste0("Accuracy: ",accuracy(tb))
## [1] "Accuracy: 40.221220811143"

Hierarchial Clustering

# Scale our data first
data <- shoppers[,-18]

data <- scale(data)
head(data)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1      -0.768773              -0.6425037    -0.5240717             -0.4952383
## 2      -0.768773              -0.6425037    -0.5240717             -0.4952383
## 3      -0.768773              -0.6425037    -0.5240717             -0.4952383
## 4      -0.768773              -0.6425037    -0.5240717             -0.4952383
## 5      -0.768773              -0.6425037    -0.5240717             -0.4952383
## 6      -0.768773              -0.6425037    -0.5240717             -0.4952383
##   ProductRelated ProductRelated_Duration BounceRates  ExitRates PageValues
## 1     -0.9237940              -0.8835210  2.69534563  2.6456148 -0.5367523
## 2     -0.8909011              -0.8308182 -0.49740887  2.6456148 -0.5367523
## 3     -0.9237940              -0.8835210  2.69534563  2.6456148 -0.5367523
## 4     -0.8909011              -0.8813250  2.69534563  2.6456148 -0.5367523
## 5     -0.6277577              -0.3667864 -0.05975975  0.1422596 -0.5367523
## 6     -0.3317213              -0.7565264 -0.15189640 -0.3845030 -0.5367523
##   SpecialDay     Month OperatingSystems    Browser     Region TrafficType
## 1 -0.3376272 -1.334436       -1.4324443 -1.0259045 -0.9150294 -0.78367014
## 2 -0.3376272 -1.334436       -0.1128743 -0.2161545 -0.9150294 -0.54236901
## 3 -0.3376272 -1.334436        2.5262657 -1.0259045  2.1178514 -0.30106789
## 4 -0.3376272 -1.334436        1.2066957 -0.2161545 -0.4817607 -0.05976676
## 5 -0.3376272 -1.334436        1.2066957  2.2130955 -0.9150294 -0.05976676
## 6 -0.3376272 -1.334436       -0.1128743 -0.2161545 -0.9150294 -0.30106789
##   VisitorType    Weekend
## 1   0.4096534 -0.5530653
## 2   0.4096534 -0.5530653
## 3   0.4096534 -0.5530653
## 4   0.4096534 -0.5530653
## 5   0.4096534  1.8079567
## 6   0.4096534 -0.5530653
#calculate the Euclidean distance
d <- dist(data, method = "euclidean")

# apply the h-clustering and use different methods
model_comp <- hclust(d, method = "complete")

model_ward <- hclust(d, method = "ward.D")
model_ward2 <- hclust(d, method = "ward.D2")
model_sing <- hclust(d, method = "single")
model_avg <- hclust(d, method = "average")
model_mc <- hclust(d, method = "mcquitty")
model_med <- hclust(d, method = "median")
model_cent <- hclust(d , method = "centroid")

# Plot Dendograms

plot(model_comp, cex = 0.6, hang = -1)

plot(model_ward, cex = 0.6, hang = -1)

plot(model_ward2, cex = 0.6, hang = -1)

plot(model_sing, cex = 0.6, hang = -1)

plot(model_avg, cex = 0.6, hang = -1)

plot(model_mc, cex = 0.6, hang = -1)

plot(model_med, cex = 0.6, hang = -1)

plot(model_cent, cex = 0.6, hang =-1)

The two Clustering methods, K-Means Clustering achieved an accuracy of 39% while for the Hierarchical Clustering the model that uses the ward.D2 method best clusters. It forms clusters better than K-Means because we can see it forming a finally branch on the dendogram

The challenge with Hierarchical Clustering you cannot view the variables and how they formed the cluster, also can measure it performance and compare it with K-Means.

In terms of being able to help the Client understand their Customers’ behaviors we can use the Hierarchical Clustering although we need another model to measure on performance and classify a customer as likely to generate Revenue or not

Challenge the Solution

From the two Algorithm we can see it there were not sufficient in giving us performance, K-Means performed poorly while Hierarchical Clustering was able to form clusters but there is no way to measure performance

Therefore I believe we can explore on other Unsupervised Machine Learning Models like Neural Networks